Combining Linguistics with statistics for multiword term extraction: a fruitfull association?

نویسندگان

  • Gaël Dias
  • Sylvie Guilloré
  • Jean-Claude Bassano
  • José Gabriel Pereira Lopes
چکیده

The acquisition of multiword terms from large text collections is a fundamental issue in the context of Information Retrieval. Indeed, their identification leads to improvements in the indexing process and allows guiding the user in his search for information. In this paper, we present an original methodology that allows extracting multiword terms by either (1) exclusively considering statistical word regularities or by (2) combining word statistics with endogenously acquired linguistic information. For that purpose, we conjugate a new association measure called the Mutual Expectation with a new acquisition process called the LocalMaxs. On one hand, the Mutual Expectation, based on the concept of Normalised Expectation, evaluates the degree of cohesiveness that links together all the textual units contained in an n-gram (i.e. ∀n, n ≥ 2). On the other hand, the LocalMaxs retrieves the candidate terms from the set of all the valued n-grams by evidencing local maxima of association measure values. Finally, we compare the results obtained by applying the methodology over a raw Portuguese text with the results reached by combining word statistics with linguistic information endogenously acquired from the same corpus previously tagged.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiword Unit Hybrid Extraction

This paper describes an original hybrid system that extracts multiword unit candidates from part-of-speech tagged corpora. While classical hybrid systems manually define local part-ofspeech patterns that lead to the identification of well-known multiword units (mainly compound nouns), our solution automatically identifies relevant syntactical patterns from the corpus. Word statistics are then c...

متن کامل

Multi-word Term Extraction Based on New Hybrid Approach for Arabic Language

Arabic Multiword Term are relevant strings of words in text documents. Once they are automatically extracted, they can be used to increase the performance of any text mining applications such as Categorisation, Clustering, Information Retrieval System, Machine Translation, and Summarization, etc. This paper introduces our proposed Multiword term extraction system based on the contextual informa...

متن کامل

Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora

Multiword units are groups of words that occur together more often than expected by chance in sub-languages. Président de la République, Coupe du monde and Traité de Maastricht are multiword units. Unfortunately, most of the machine-readable dictionaries contain clearly insufficient information about multiword units. Therefore, their automatic extraction from corpora is an important issue not o...

متن کامل

Yet Another Ranking Function for Automatic Multiword Term Extraction

Term extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to t...

متن کامل

A System for Compound Noun Multiword Expression Extraction for Hindi

Compound noun multiword expressions are important for many NLP applications like machine translation and information retrieval. This paper describes a system for Hindi compound noun multiword expressions (MWE) extraction from a given corpus. We identify major categories of compound noun MWEs, based on linguistic and psycholinguistic principles. Our extraction methods use various statistical co-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000